Add EPYC CPU serving skill (vLLM + zentorch) by amd-lalithnc · Pull Request #76 · amd/skills

amd-lalithnc · 2026-06-24T12:17:00Z

What

Adds serving-llms-on-epyc: a skill that brings up a single vLLM OpenAI endpoint on an
AMD EPYC CPU host with the zentorch backend, in a container (Docker/Podman) or a conda env.

Flow

Detect the CPU: vendor, EPYC generation + Zen arch, AVX-512, physical cores, NUMA, RAM (detect.py).
Validate the environment (validate.py): container runtime (docker/podman) or conda fallback;
image present, and if already pulled, import vllm, zentorch inside it; host perf libraries
(tcmalloc / OpenMP via LD_PRELOAD); HF_TOKEN; RAM.
Resolve + check the model (check_model.py): confirm vLLM supports the architecture via its
model registry (text or multimodal); reject pooling / non-LLM (not chat endpoints).
Gated models require HF_TOKEN + license acceptance.
Check RAM fit (estimate_memory.py): weights + KV cache + headroom ≤ host RAM.
Size the runtime from the hardware (cpu_tune.py): bind to socket 0's physical cores and
set VLLM_CPU_KVCACHE_SPACE; no memory binding by default (NPS2/NPS4 get a perf note).
Confirm: present a sized plan and wait for the user to confirm before launching.
Launch: vllm serve (never --device cpu on vLLM ≥ 0.20).
Verify + hand over: poll /health, validate the /v1/chat/completions endpoint, then print a
connection table.

Single instance. On any failure it reports the cause + logs and stops, no retry, no debugging loop.

Notes / scope

Uses the amdih/zendnn_zentorch image on Docker Hub.
KV cache is bf16-only on zentorch CPU; TORCHINDUCTOR_FREEZING=1 requires VLLM_USE_AOT_COMPILE=0.
OMP_NUM_THREADS and VLLM_CPU_NUM_OF_RESERVED_CPU are intentionally left unset — vLLM derives
them (from the bind list / its own default).
NUMA default: socket 0's physical cores, no memory binding.

Testing

Structural gate (check.sh): passes (0 errors).
Behavioral eval (LLM-judged, sonnet): 13/13 — exercises detect → validate → check_model →
estimate → cpu_tune → confirm, plus the guardrails. Live launch/serve is the manual /
integration tier on a real EPYC host.

Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

danielholanda · 2026-06-25T22:24:52Z

@Mahdi-CV Can you help review this?

amd-lalithnc · 2026-06-30T09:24:31Z

hi @danielholanda @Mahdi-CV @shailensobhee, can we move ahead with review and CI for this PR? thanks!

shailensobhee · 2026-06-30T09:56:28Z

Hi @amd-lalithnc , a few things so far. Having benchmarked ZenDNN 6.0 recently, I noticed this versioning issue, that maybe you'd want to clarify in the SKILL itself:
It looks like you can use vlllm 0.23.0 only if you are in a conda env and use the zentorch 2.11 wheel file. If you go the docker/podman route, it's vllm 0.22.0. The latest version listed here is 0.22.0. (https://hub.docker.com/r/amdih/zendnn_zentorch).

Do you agree on this observation? If yes, we may need to clarify this in the SKILL file and associated documentation.

If you have a dual socket machine, how do you dictate for example, use socket 1 (as opposed to socket 0) only ? What if the system's socket 0 is already busy? With this skill, it appears that we will try to force use socket 0, even if socket 1 is idle.
You seem to size KV cache on the whole system's RAM, but since you are binding to socket 0, maybe you'd want to do memory binding too? Else you'd hit massive performance issues accessing KVcache data allocated on memory bound to socket 1.

Conclusion: We can merge this skill, but there are potential performance aspects to narrow down. Thoughts?

cc: @Vkathail

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e

amd-lalithnc · 2026-06-30T14:48:06Z

hi @shailensobhee, thanks for your thoughts!

0.23.0 is in validation phase, expected release date, 8th July. current open source version supported is 0.22.0.
have added a logic to default to socket 0, if busy, move to the other socket if available, and if both are busy, proceed with socket 0, with a warning.
we have added limited memory binding, confining memory available per socket. in case of multiple NUMA nodes per socket, a warning message is displayed

let me know if the changes are suitable. thanks!

shailensobhee

Approving. All three points I raised earlier have been addressed in the code, verified against the head commit:

vLLM version clarity - data/epyc.json pins vllm_version: 0.22.0 with the matching public container tag (amdih/zendnn_zentorch:vllm_v0.22.0_zentorch_v2.11.0.1_...). Pinning the public stable 0.22.0 is correct while 0.23.0 is still in validation. The 0.23 / zentorch 2.11 TORCHINDUCTOR_FREEZING crash gotcha is documented.
Dual-socket selection - cpu_tune.py now samples per-socket load from /proc/stat, prefers a free socket, falls back to the least-busy one with a warning when both are busy, and supports --socket N to force.
KV-cache locality - memory is now bound to the chosen socket (numactl --cpunodebind/--membind for conda, --cpuset-mems for containers) and KV cache is sized from that socket's local RAM, not whole-system RAM. NPS2/NPS4 multi-node cases emit a note.

Note on CI: the behavioral checks are red due to a CI infra issue, not the skill. The eval harness fails at setup in conftest.py because the runner's claude judge CLI is not authenticated (Not logged in / Please run /login), so zero behavioral assertions actually executed. This is expected for a fork PR where Actions secrets are withheld. All substantive checks pass: skill validation, manifest validation, SkillSpector security scan, and external-reference checks. Recommend a maintainer with CI-secret access re-run the behavioral job (or run it from an in-repo branch) to get a clean green before merge.

add serving-llms-on-epyc skill (vLLM + zentorch CPU serving)

f62dd74

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6

danielholanda requested a review from Mahdi-CV June 25, 2026 22:24

danielholanda requested a review from shailensobhee June 29, 2026 21:57

address review comments

8aff564

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e

Merge branch 'main' into add-serving-llms-on-epyc

9a6d185

shailensobhee approved these changes Jun 30, 2026

View reviewed changes

shailensobhee merged commit a8fb081 into amd:main Jun 30, 2026
14 of 17 checks passed

amd-lalithnc mentioned this pull request Jul 1, 2026

Add serving-llms-on-epyc walkthrough #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EPYC CPU serving skill (vLLM + zentorch)#76

Add EPYC CPU serving skill (vLLM + zentorch)#76
shailensobhee merged 3 commits into
amd:mainfrom
amd-lalithnc:add-serving-llms-on-epyc

amd-lalithnc commented Jun 24, 2026

Uh oh!

danielholanda commented Jun 25, 2026

Uh oh!

amd-lalithnc commented Jun 30, 2026 •

edited

Loading

Uh oh!

shailensobhee commented Jun 30, 2026 •

edited

Loading

Uh oh!

amd-lalithnc commented Jun 30, 2026

Uh oh!

shailensobhee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

amd-lalithnc commented Jun 24, 2026

What

Flow

Contents

Notes / scope

Testing

Uh oh!

danielholanda commented Jun 25, 2026

Uh oh!

amd-lalithnc commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shailensobhee commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-lalithnc commented Jun 30, 2026

Uh oh!

shailensobhee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-lalithnc commented Jun 30, 2026 •

edited

Loading

shailensobhee commented Jun 30, 2026 •

edited

Loading